11 research outputs found

    Inference and Evaluation of the Multinomial Mixture Model for Text Clustering

    Full text link
    In this article, we investigate the use of a probabilistic model for unsupervised clustering in text collections. Unsupervised clustering has become a basic module for many intelligent text processing applications, such as information retrieval, text classification or information extraction. The model considered in this contribution consists of a mixture of multinomial distributions over the word counts, each component corresponding to a different theme. We present and contrast various estimation procedures, which apply both in supervised and unsupervised contexts. In supervised learning, this work suggests a criterion for evaluating the posterior odds of new documents which is more statistically sound than the "naive Bayes" approach. In an unsupervised context, we propose measures to set up a systematic evaluation framework and start with examining the Expectation-Maximization (EM) algorithm as the basic tool for inference. We discuss the importance of initialization and the influence of other features such as the smoothing strategy or the size of the vocabulary, thereby illustrating the difficulties incurred by the high dimensionality of the parameter space. We also propose a heuristic algorithm based on iterative EM with vocabulary reduction to solve this problem. Using the fact that the latent variables can be analytically integrated out, we finally show that Gibbs sampling algorithm is tractable and compares favorably to the basic expectation maximization approach

    Online Evaluation of Coreference Resolution

    Get PDF
    Colloque avec actes et comité de lecture. internationale.International audienceThis paper presents the design of an online evaluation service for coreference resolution in texts. We argue that coreference, as an equivalence relation between referring expressions (RE) in texts, should be properly distinguished from anaphora and has therefore to be evaluated separately. The annotation model for coreference is based on links between REs. The program presented in this article compares two such annotations, which may be the output of coreference resolution tools or of human judgement. In order to evaluate the agreement between the two annotations, the evaluator first converts the input annotation format into a pivot format, then abstracts equivalence classes from the links and provides five scores representing in different ways the similarity between the two partitions: MUC, B3, Kappa, Core-discourse-entity, and Mutual-information. Although we consider that the identification of REs (i.e. the elements of the partition) should not be part of coreference resolution properly speaking, we propose several solutions for the frequent case when the input files do not agree on the elements of the text to consider as REs

    Méthodes probabilistes pour l'analyse exploratoire de données textuelles

    No full text
    Nous abordons le problème de la classification non supervisée de documents par des méthodes probabilistes. Notre étude se concentre sur le modèle de mélange de lois multinomiales avec variables latentes thématiques au niveau des documents. La construction de groupes de documents thématiquement homogènes est une des technologies de base de la fouille de texte, et trouve de multiples applications, aussi bien en recherche documentaire qu'en catégorisation de documents, ou encore pour le suivi de thèmes et la construction de résumés. Diverses propositions récentes ont été faites de modèles probabilistes permettant de déterminer de tels regroupements. Les modèles de classification probabilistes peuvent également être vus comme des outils de construction de représentations numériques synthétiques d'informations contenues dans le document. Ces modèles, qui offrent des facilités pour la généralisation et l'interprétation des résultats, posent toutefois des problèmes d'estimation difficiles, dûs en particulier à la très grande dimensionnalité du vocabulaire. Notre contribution à cette famille de travaux est double : nous présentons d'une part plusieurs algorithmes d'inférence, certains originaux, pour l'estimation du modèle de mélange de multinomiales ; nous présentons également une étude systématique des performances de ces algorithmes, fournissant ainsi de nouveaux outils méthodologiques pour mesurer les performances des outils de classification non supervisée. Les bons résultats obtenus par rapport à d'autres algorithmes classiques illustrent, à notre avis, la pertinence de ce modèle de mélange simple pour les corpus regroupant essentiellement des documents monothématiques.In this thesis, we investigate the use of a probabilistic model for unsupervised clustering of text collections. We focus in particular on the multinomial mixture model, with one latent theme variable per document. Unsupervised clustering has become a basic module for many intelligent text processing applications, such as information retrieval, text classification or information extraction. Recent proposals have been made of probabilistic clustering models, which build "soft'' theme-document associations. These models allow to compute, for each document, a probability vector whose values can be interpreted as the strength of the association between documents and clusters. As such, these vectors can also serve to project texts into a lower-dimensional "semantic'' space. These models however pose non-trivial estimation problems, which are aggravated by the very high dimensionality of the parameter space. The contribution of this study is twofold. First, we present and contrast various estimation procedures for the multinomial mixture model, some of which had not been tested before in this context. Second, we propose a systematic evaluation of the performances of these algorithms, thereby defining a framework to assess the quality of unsupervised text clustering methods. The comparison with the performances of other classical models demonstrates, in our opinion, the relevance of the simple multinomial mixture model for clustering corpus mainly composed of monothematic documents.PARIS-Télécom ParisTech (751132302) / SudocSudocFranceF

    Which granularity to bootstrap a multilingual method of document alignment: character N-grams or word N-grams?

    Get PDF
    International audienceThis article tackle multilingual automatic alignment. Alignment refers to the process by which segments that are translation ofone another are automatically matched. Instead of comparing only pairs of languages at sentence level, as it is usually done toconform to human process in translation. The computer is used here for its capacity to infer semantic alignment from a collection oftexts that are translations of the same content. The corpus contains press releases from Europa, the European Community website,available in up to 23 languages. The alignment process takes advantage of frequency similarity between different linguistic versionsof a document by computing matching features for each repeated string in all versions. This is done to find reliable anchors inthe process of linking versions. The question of the best granularity is raised to bring out some semantic equivalences, whencomparing two linguistic versions, character N-grams or word N-grams. The alignment systems are traditionally based on wordN-grams splitting. The observation of the morphological variety of languages, even inside a single linguistic family, quickly showsthat the word granularity is inadequate to provide a widely multilingual system, i.e. a language independent system able to handleflexional languages as well as positional languages. Instead, when starting from a multilingual collection to focus on pairs of texts,we defend that character N-grams alignment is more efficient than word N-grams alignment

    Multilingual corpus allows automatic detection of parallel areas in pairs of documents and semantic matching of lemmas

    No full text
    International audienceData provided are for informational purposes only. Although carefully collected, accuracy cannot be guaranteed. The impact factor represents a rough estimation of the journal's impact factor and does not reflect the actual current impact factor. Publisher conditions are provided by RoMEO. Differing provisions from the publisher's actual policy or licence agreement may be applicable

    Online Evaluation of Coreference Resolution

    Get PDF
    This paper presents the design of an online evaluation service for coreference resolution in texts. We argue that coreference, as an equivalence relation between referring expressions (RE) in texts, should be properly distinguished from anaphora and has therefore to be evaluated separately. The annotation model for coreference is based on links between REs. The program presented in this article compares two such annotations, which may be the output of coreference resolution tools or of human judgement. In order to evaluate the agreement between the two annotations, the evaluator first converts the input annotation format into a pivot format, then abstracts equivalence classes from the links and provides five scores representing in different ways the similarity between the two partitions: MUC, B3, Kappa, Core-discourse-entity, and Mutual-information. Although we consider that the identification of REs (i.e. the elements of the partition) should not be part of coreference resolution properly speaking, we propose several solutions for the frequent case when the input files do not agree on the elements of the text to consider as REs

    Online Evaluation of Coreference Resolution

    No full text
    This paper presents the design of an online evaluation service for coreference resolution in texts. We argue that coreference, as an equivalence relation between referring expressions (RE) in texts, should be properly distinguished from anaphora and has therefore to be evaluated separately. The annotation model for coreference is based on links between REs. The program presented in this article compares two such annotations, which may be the output of coreference resolution tools or of human judgement. In order to evaluate the agreement between the two annotations, the evaluator first converts the input annotation format into a pivot format, then abstracts equivalence classes from the links and provides five scores representing in different ways the similarity between the two partitions: MUC, B3, Kappa, Core-discourse-entity, and Mutual-information. Although we consider that the identification of REs (i.e. the elements of the partition) should not be part of coreference resolution properly speaking, we propose several solutions for the frequent case when the input files do not agree on the elements of the text to consider as REs

    Détection de zones parallèles à l’intérieur de multi-documents pour l’alignement multilingue

    No full text
    National audienceThis article broaches a central issue of the automatic alignment : diagnosing the parallelism ofdocuments. Previous research was concentrated on the analysis of documents which are parallelby nature such as corpus of regulations, technical documents or simple sentences. Inversions anddeletions/additions phenomena that may exist between different versions of a document hasoften been overlooked. To the contrary, we propose a method to diagnose in context the parallelareas allowing the detection of deletions or inversions between documents to align. This originalmethod is based on the freeing from word and sentence as well as the consideration of the textformatting. The implementation is based on the detection of repeated character strings and theidentification of parallel segments by image processing.Cet article aborde une question centrale de l’alignement automatique, celle du diagnosticde parallélisme des documents à aligner. Les recherches en la matière se sont jusqu’alorsconcentrées sur l’analyse de documents parallèles par nature : corpus de textes réglementaires,documents techniques ou phrases isolées. Les phénomènes d’inversions et de suppressions/ajoutspouvant exister entre les différentes versions d’un document sont ainsi souvent ignorées. Nousproposons donc une méthode pour diagnostiquer en contexte des zones parallèles à l’intérieurdes documents. Cette méthode permet la détection d’inversions ou de suppressions entre lesdocuments à aligner. Elle repose sur l’affranchissement de la notion de mot et de phrase, ainsique sur la prise en compte de la Mise en Forme Matérielle du texte (MFM). Sa mise en oeuvre estbasée sur des similitudes de répartition de chaînes de caractères répétées dans les différentsdocuments. Ces répartitions sont représentées sous forme de matrices et l’identification deszones parallèles est effectuée à l’aide de méthodes de traitement d’image
    corecore